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Abstract 

We consider the problem of estimating sparse graphs by a lasso 
penalty applied to the inverse covariance matrix. Using a coordinate 
descent procedure for the lasso, we develop a simple algorithm that is 
remarkably fast: in the worst cases, it solves a 1000 node problem (~ 
500, 000 parameters) in about a minute, and is 50 to 2000 times faster 
than competing methods. It also provides a conceptual link between 
the exact problem and the approximation suggested by Meinshausen 
& Biihlmann (2006). We illustrate the method on some cell-signaling 
data from proteomics. 



1 Introduction 

In recent years a number of authors have proposed the estimation of sparse 
undirected graphical models through the use of L x (lasso) regularization. 
The basic model for continuous data assumes that the observations have a 
multivariate Gaussian distribution with mean /i and covariance matrix S. If 
the r/'th component of S _1 is zero, then variables % and j are conditionally 
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independent, given the other variables. Thus it makes sense to impose an L 1 
penalty for the estimation of E -1 . 

Meinshausen & Biihlmann (2006) take a simple approach to this problem: 
they estimate a sparse graphical model by fitting a lasso model to each vari- 
able, using the others as predictors. The component E^- 1 is then estimated 
to be non-zero if either the estimated coefficient of variable % on j, or the 
estimated coefficient of variable j on i, is non-zero (alternatively they use an 
AND rule). They show that asymptotically, this consistently estimates the 
set of non-zero elements of X -1 . 

Other authors have proposed algorithms for the exact maximization of 
the L^penalized log-likelihood; both Yuan & Lin (2007) and Banerjee et al. 
(2007) adapt interior point optimization methods for the solution to this 
problem. Both papers also establish that the simpler approach of Mein- 
shausen & Biihlmann (2006) can be viewed as an approximation to the exact 
problem. 

We use the development in Banerjee et al. (2007) as a launching point, 
and propose a simple, lasso-style algorithm for the exact problem. This new 
procedure is extremely simple, and is substantially faster than the interior 
point approach in our tests. It also bridges the "conceptual gap" between 
the Meinshausen & Biihlmann (2006) proposal and the exact problem. 

2 The proposed method 

Suppose we have N multivariate normal observations of dimension p, with 
mean p and covariance E. Following Banerjee et al. (2007), let = S -1 , 
and let S be the empirical covariance matrix, the problem is to maximize the 
log-likelihood 

logdet0-tr(S0) -p||0||i, (1) 

where tr denotes the trace and ||0||i is the L\ norm — the sum of the absolute 
values of the elements of E _1 . Expression (1) is the Gaussian log-likelihood 
of the data, partially maximized with respect to the mean parameter p. 
Yuan & Lin (2007) solve this problem using the interior point method for 
the "maxdet" problem, proposed by Vandenberghe et al. (1998). Banerjee 
et al. (2007) develop a different framework for the optimization, which was 
the impetus for our work. 
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Banerjee et al. (2007) show that the problem (1) is convex and consider 
estimation of £ (rather than £ _1 ), as follows. Let W be the estimate of S. 
They show that one can solve the problem by optimizing over each row and 
corresponding column of W in a block coordinate descent fashion. Partition- 
ing W and S 

w= (Wp w 12 \ s= rSn s 12 \ 

\W{ 2 W 22 J V S 12 S 22/ 

they show that the solution for wu satisfies 

W12 = argmm y {y T W{ 1 1 y : \ \y - si 2 ||oo < p}- (3) 

This is a box-constrained quadratic program which they solve using an inte- 
rior point procedure. Permuting the rows and columns so the target column 
is always the last, they solve a problem like (3) for each column, updating 
their estimate of W after each stage. This is repeated until convergence. Us- 
ing convex duality, Banerjee et al. (2007) go on to show that (3) is equivalent 
to the dual problem 

mi^||W 1 1 1 /2 /3-&|| 2 + p||/3||i, (4) 

where b = W u Si 2 /2. This expression is the basis for our approach. 

First we note that it is easy to verify the equivalence between the solutions 
to (1) and (4) directly. The sub-gradient equation for maximization of the 
log-likelihood (1) is 

w - s - p-r = 0, (5) 

using the fact that the derivative of logdet 6 equals 6 _1 = W, given in e.g 
Boyd & Vandenberghe (2004), page 641. Here T^- G sign(6jj); i.e. = 
sign(e^) if Gij ^ 0, else e [-1, 1] if 6^- = 0. 
Now the upper right block of equation (5) is 

wu - S12 - p • 712 = 0, (6) 

using the same sub-matrix notation as in (2). 

On the other hand, the sub-gradient equation from (4) works out to be 

2Wn/3 -si 2 + p- 1/ = 0, (7) 

where v G sign(/3) element-wise. 



3 



Now suppose (W, T) solves (5), and hence (u>i 2 ,7i 2 ) solves (6). Then 
P = ^W^w^ and v = —712 solves (7). The equivalence of the first two 
terms is obvious. For the sign terms, since W11612 + ^12^22 = 0, we have 
that #12 = —Qi?W\\W\2 (partitioned-inverse formula). Since #22 > 0, then 
sign(0i 2 ) = -signfW^wis) = -sign(/3). 

Now to the main point of this paper. Problem (4) looks like a lasso (Li- 
regularized) least squares problem. In fact if Wu = Su, then the solutions 
(3 are easily seen to equal one-half of the lasso estimates for the pth variable 
on the others, and hence related to the Meinshausen & Buhlmann (2006) 
proposal. As pointed out by Banerjee et al. (2007), Wu 7^ Su in general 
and hence the Meinshausen & Buhlmann (2006) approach does not yield 
the maximum likelihood estimator. They point out that their block-wise 
interior-point procedure is equivalent to recursively solving and updating 
the lasso problem (4), but do not pursue this approach. We do, to great 
advantage, because fast coordinate descent algorithms (Friedman et al. 2007) 
make solution of the lasso problem very attractive. 

In terms of inner products, the usual lasso estimates for the pth variable 
on the others take as input the data Su and s 12 . To solve (4) we instead use 
Wu and si 2 , where Wu is our current estimate of the upper block of W. We 
then update w and cycle through all of the variables until convergence. 

Note that from (5), the solution Wu = su + p for all i, since 9u > 0, and 
hence 1^ = 1. Here is our algorithm in detail: 

Covariance Lasso Algorithm 

1. Start with W = S + pI. The diagonal of W remains unchanged in what 
follows. 

2. For each j = 1, 2, . . .p, 1, 2, . . .p, . . ., solve the lasso problem (4), which 
takes as input the inner products Wu and S12. This gives a p — 1 
vector solution (3. Fill in the corresponding row and column of W 
using w = 2WuP- 

3. Continue until convergence 

Note again that each step in step (2) implies a permutation of the rows 
and columns to make the target column the last. The lasso problem in step 
(2) above can be efficiently solved by coordinate descent (Friedman et al. 
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(2007) ,Wu & Lange (2007)). Here are the details. Letting V = W u , then 
the update has the form 

0. <_ S (s 12j - 2 Vkjh, P)/(2%) (8) 

for j = 1, 2, . . .p, j = 1, 2, . . .p, . . ., where 5 is the soft-threshold operator: 

S{x,t) = sign{x){\x\-t) + . (9) 

We cycle through the predictors until convergence. 

Note that $ will typically be sparse, and so the computation w = 2W n (3 
will be fast: if there are r non-zero elements, it takes rp operations. 

Finally, suppose our final estimate of E is X = W, and store the estimates 
f3 from the above in the rows and columns of a p x p matrix B (note that 
the diagonal of B is not determined). Then we can obtain the pth row (and 
column) of 6 = = W' 1 as follows: 

1 

Wpp -2J2 k ^p BkpW k p 

-2QppB kp ; k ^ p (10) 

Interestingly, if W = S, these are just the formulas for obtaining the inverse 
of a partitioned matrix. That is, if we set W = S and p = in the above 
algorithm, then one sweep through the predictors computes S -1 , using a 
linear regression at each stage. 

3 Timing comparisons 

We simulated Gaussian data from both sparse and dense scenarios, for a 
range of problem sizes p. The sparse scenario is the AR(1) model taken from 
Yuan & Lin (2007): f3a = 1, /^-i = = 0.5, and zero otherwise. In 

the dense scenario, f3u = 2,{3u' = 1 otherwise. We chose the the penalty 
parameter so that the solution had about the actual number of non-zero 
elements in the sparse setting, and about half of total number of elements 
in the dense setting. The convergence threshold was 0.0001. The covariance 
lasso procedure was coded in Fortran, linked to an R language function. All 
timings were carried out on a Intel Xeon 2.80GH processor. 
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V 


Problem 


(1) Covariance 


(2) Approx 


(3) COVSEL 


Ratio of 




Type 


Lasso 






(3) to (1) 


100 


sparse 


.018 


.007 


34.67 


1926.1 


100 


dense 


.038 


.018 


2.17 


57.1 


200 


sparse 


.070 


.027 


> 205.35 


> 2933.6 


200 


dense 


.324 


.146 


16.87 


52.1 


400 


sparse 


.601 


.193 


> 1616.66 


> 2690.0 


400 


dense 


2.47 


.752 


313.04 


126.5 



Table 1: Timings (seconds) for covariance lasso, Meinhausen-Buhlmann ap- 
proximation, and COVSEL procedures. 

We compared the covariance lasso to the COVSEL program provided by 
Banerjee et al. (2007). This is a Matlab program, with a loop that calls a C 
language code to do the box-constrained QP for each column of the solution 
matrix. To be as fair as possible to COVSEL, we only counted the CPU time 
spent in the C program. We set the maximum number of outer iterations to 
30, and following the authors code, set the the duality gap for convergence 
to 0.1. 

The number of CPU seconds for each trial is shown in Table 1. In the 
dense scenarios for p = 200 and 400, COVSEL had not converged by 30 
iterations. We see that the covariance Lasso is 50 to 2000 times faster than 
COVSEL, and only about 3 times slower than the approximate method. Thus 
the covariance lasso is taking only about 3 passes through the the columns 
of W on average. 

Figure 1 shows the number of CPU seconds required for the covariance 
lasso procedure, for problem sizes up to 1000. Even in the dense scenario, it 
solves a 1000 node problem (~ 500, 000 parameters) is about a minute. 

4 Analysis of cell signalling data 

For illustration we analyze a flow cytometry dataset on p = 11 proteins and 
n = 7466 cells, from Sachs et al. (2003). These authors fit a directed acyclic 
graph (DAG) to the data, producing the network in Figure 2. 

The result of applying the covariance Lasso to these data is shown in 
Figure 3, for 12 different values of the penalty parameter p. There is moderate 
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Figure 1: Number of CPU seconds required for the covariance lasso procedure. 




Figure 2: Directed acylic graph from cell- signaling data, from Sachs et al. (2003). 
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L1 norm- 2.061 17 L1 norm- 0.09437 L1 norm- 0.04336 




Erfc AkJ Erfc M§ E% A^ 

L1 norm- 0.00927 L1 norm- 0.00687 L1 norm- 0.00496 




L1 norm- 0.00372 L1 norm- 0.00297 L1 norm- 7e-05 

Figure 3: Cell-signaling data: undirected graphs from covariance lasso with dif- 
ferent values of the penalty parameter p. 
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agreement between, for example, the graph for LI norm = 0.00496 and the 
DAG: the former has about half of the edges and non-edges that appear in 
the DAG. Figure 4 shows the lasso coefficients as a function of total L\ norm 
of the coefficient vector. 

In the left panel of Figure 5 we tried two different kinds of 10-fold cross- 
validation for estimation of the parameter p. In the "Regression" approach, 
we fit the covariance-lasso to nine-tenths of the data, and used the penalized 
regression model for each protein to predict the value of that protein in the 
validation set. We then averaged the squared prediction errors over all 11 
proteins. In the "Likelihood" approach, we again applied the covariance- 
lasso to nine-tenths of the data, and then evaluated the log-likelihood (1) 
over the validation set. The two cross-validation curves indicate that the 
unregularized model is the best, not surprising give the large number of 
observations and relatively small number of parameters. However we also see 
that the likelihood approach is far less variable than the regression method. 

The right panel compares the cross- validated sum of squares of the exact 
covariance lasso approach to the Meinhausen-Buhlmann approximation. For 
lightly regularized models, the exact approach has a clear advantage. 

5 Discussion 

We have presented a simple and fast algorithm for estimation of a sparse 
inverse covariance matrix using an L\ penalty. It cycles through the variables, 
fitting a modified lasso regression to each variable in turn. The individual 
lasso problems are solved by coordinate descent. 

The speed of this new procedure should facilitate the application of sparse 
inverse covariance procedures to large datasets involving thousands of param- 
eters. 

Fortran and R language routines for the proposed methods will be made 
freely available. 
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Figure 4: Cell-signaling data: profile of coefficients as the total L\ norm of the co- 
efficient vector increases, that is, as p decreases. Profiles for the largest coefficients 
are labeled with the corresponding pair of proteins. 
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Figure 5: Cell-signaling data. Left panel shows tenfold cross-validation using both 
Regression and Likelihood approaches (details in text). Right panel compares the re- 
gression sum of squares of the exact covariance lasso approach to the Meinhausen- 
Buhlmann approximation. 
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